摘要 :
Data analysts often need to transform an existing dataset, such as with filtering, into a new dataset for downstream analysis. Even the. most trivial of mistakes in this phase can introduce bias and lead to the formation of invali...
展开
Data analysts often need to transform an existing dataset, such as with filtering, into a new dataset for downstream analysis. Even the. most trivial of mistakes in this phase can introduce bias and lead to the formation of invalid conclusions. For example, consider a researcher identifying subjects for trials of a new statin drug. She might identify patients with a high dietary cholesterol intake as a population likely to benefit from the drug, however, selection of these individuals could bias the test population to those with a generally unhealthy lifestyle, thereby compromising the analysis. Reducing the potential for bias in the dataset transformation process can minimize the need to later engage in the tedious, time-consuming process of trying to eliminate bias while preserving the target dataset. We propose a novel interaction model for explain-and-repair data transformation systems, in which users interactively define constraints for transformation code and the resultant data. The system satisfies these constraints as far as possible, and provides an explanation for any problems encountered. We present an algorithm that yields filter-based transformation code satisfying user constraints. We implemented and evaluated a prototype of this architecture, EMERIL. using both synthetic and real-world datasets. Our approach finds solutions 34% more often and 77% more quickly than the previous state-of-the-art solution.
收起
摘要 :
Data analysts often need to transform an existing dataset, such as with filtering, into a new dataset for downstream analysis. Even the. most trivial of mistakes in this phase can introduce bias and lead to the formation of invali...
展开
Data analysts often need to transform an existing dataset, such as with filtering, into a new dataset for downstream analysis. Even the. most trivial of mistakes in this phase can introduce bias and lead to the formation of invalid conclusions. For example, consider a researcher identifying subjects for trials of a new statin drug. She might identify patients with a high dietary cholesterol intake as a population likely to benefit from the drug, however, selection of these individuals could bias the test population to those with a generally unhealthy lifestyle, thereby compromising the analysis. Reducing the potential for bias in the dataset transformation process can minimize the need to later engage in the tedious, time-consuming process of trying to eliminate bias while preserving the target dataset. We propose a novel interaction model for explain-and-repair data transformation systems, in which users interactively define constraints for transformation code and the resultant data. The system satisfies these constraints as far as possible, and provides an explanation for any problems encountered. We present an algorithm that yields filter-based transformation code satisfying user constraints. We implemented and evaluated a prototype of this architecture, EMERIL. using both synthetic and real-world datasets. Our approach finds solutions 34% more often and 77% more quickly than the previous state-of-the-art solution.
收起
摘要 :
Nowcasting is the practice of using social media data to quantify ongoing real-world phenomena. It has been used by researchers to measure flu activity, unemployment behavior, and more. However, the typical nowcasting workflow req...
展开
Nowcasting is the practice of using social media data to quantify ongoing real-world phenomena. It has been used by researchers to measure flu activity, unemployment behavior, and more. However, the typical nowcasting workflow requires either slow and tedious manual searching of relevant social media messages or automated statistical approaches that are prone to spurious and low-quality results. In this paper, we propose a method for declaratively specifying a nowcasting model; this method involves processing a user query over a very large social media database, which can take hours. Due to the human-in-the-loop nature of constructing nowcasting models, slow runtimes place an extreme burden on the user. Thus we also propose a novel set of query optimization techniques, which allow users to quickly construct nowcasting models over very large datasets. Further, we propose a novel query quality alarm that helps users estimate phenomena even when historical ground truth data is not available. These contributions allow us to build a declarative nowcasting data management system, RaccoonDB, which yields high-quality results in interactive time. We evaluate RACCOONDB using 40 billion tweets collected over five years. We show that our automated system saves work over traditional manual approaches while improving result quality-57% more accurate in our user study-and that its query optimizations yield a 424x speedup, allowing it to process queries 123x faster than a 300-core Spark cluster, using only 10% of the computational resources.
收起
摘要 :
Nowcasting is the practice of using social media data to quantify ongoing real-world phenomena. It has been used by researchers to measure flu activity, unemployment behavior, and more. However, the typical nowcasting workflow req...
展开
Nowcasting is the practice of using social media data to quantify ongoing real-world phenomena. It has been used by researchers to measure flu activity, unemployment behavior, and more. However, the typical nowcasting workflow requires either slow and tedious manual searching of relevant social media messages or automated statistical approaches that are prone to spurious and low-quality results. In this paper, we propose a method for declaratively specifying a nowcasting model; this method involves processing a user query over a very large social media database, which can take hours. Due to the human-in-the-loop nature of constructing nowcasting models, slow runtimes place an extreme burden on the user. Thus we also propose a novel set of query optimization techniques, which allow users to quickly construct nowcasting models over very large datasets. Further, we propose a novel query quality alarm that helps users estimate phenomena even when historical ground truth data is not available. These contributions allow us to build a declarative nowcasting data management system, RaccoonDB, which yields high-quality results in interactive time. We evaluate RACCOONDB using 40 billion tweets collected over five years. We show that our automated system saves work over traditional manual approaches while improving result quality-57% more accurate in our user study-and that its query optimizations yield a 424x speedup, allowing it to process queries 123x faster than a 300-core Spark cluster, using only 10% of the computational resources.
收起
摘要 :
Social media nowcasting, the process of estimating real-world phenomena from social media data, has grown in popularity over the last several years as an alternative to traditional data collection methods like phone surveys. Unfor...
展开
Social media nowcasting, the process of estimating real-world phenomena from social media data, has grown in popularity over the last several years as an alternative to traditional data collection methods like phone surveys. Unfortunately, current nowcasting methods depend on pre-existing, traditionally collected survey data as an aid to sift through the huge number of signals that can be derived from social media. This dependence severely limits the applicability of current nowcasting techniques. If we could remove this need for conventional data, social media signals could describe a much wider range of target phenomena. We have built a nowcasting querying system that estimates real-world phenomena without requiring any conventional data, relying instead upon an interactive exploration with users. Specifically, our system exploits a user-provided multi-part query consisting of semantic and signal components. The user can explore in real time the tradeoff between these two components to find the most relevant social media signals to estimate the target phenomenon. Our demonstration system lets users search for signals within a large Twitter corpus using a dynamic web-based interface. Also, users can share results with the general public, review and comment on others' shared results, and clone these results as starting points for further exploration and querying.
收起
摘要 :
Social media nowcasting, the process of estimating real-world phenomena from social media data, has grown in popularity over the last several years as an alternative to traditional data collection methods like phone surveys. Unfor...
展开
Social media nowcasting, the process of estimating real-world phenomena from social media data, has grown in popularity over the last several years as an alternative to traditional data collection methods like phone surveys. Unfortunately, current nowcasting methods depend on pre-existing, traditionally collected survey data as an aid to sift through the huge number of signals that can be derived from social media. This dependence severely limits the applicability of current nowcasting techniques. If we could remove this need for conventional data, social media signals could describe a much wider range of target phenomena. We have built a nowcasting querying system that estimates real-world phenomena without requiring any conventional data, relying instead upon an interactive exploration with users. Specifically, our system exploits a user-provided multi-part query consisting of semantic and signal components. The user can explore in real time the tradeoff between these two components to find the most relevant social media signals to estimate the target phenomenon. Our demonstration system lets users search for signals within a large Twitter corpus using a dynamic web-based interface. Also, users can share results with the general public, review and comment on others' shared results, and clone these results as starting points for further exploration and querying.
收起
摘要 :
Social media nowcasting, the process of estimating real-world phenomena from social media data, has grown in popularity over the last several years as an alternative to traditional data collection methods like phone surveys. Unfor...
展开
Social media nowcasting, the process of estimating real-world phenomena from social media data, has grown in popularity over the last several years as an alternative to traditional data collection methods like phone surveys. Unfortunately, current nowcasting methods depend on pre-existing, traditionally collected survey data as an aid to sift through the huge number of signals that can be derived from social media. This dependence severely limits the applicability of current nowcasting techniques. If we could remove this need for conventional data, social media signals could describe a much wider range of target phenomena. We have built a nowcasting querying system that estimates real-world phenomena without requiring any conventional data, relying instead upon an interactive exploration with users. Specifically, our system exploits a user-provided multi-part query consisting of semantic and signal components. The user can explore in real time the tradeoff between these two components to find the most relevant social media signals to estimate the target phenomenon. Our demonstration system lets users search for signals within a large Twitter corpus using a dynamic web-based interface. Also, users can share results with the general public, review and comment on others' shared results, and clone these results as starting points for further exploration and querying.
收起
摘要 :
Social media nowcasting-using online user activity to describe real-world phenomena-is an active area of research to supplement more traditional and costly data collection methods such as phone surveys. Given the potential impact ...
展开
Social media nowcasting-using online user activity to describe real-world phenomena-is an active area of research to supplement more traditional and costly data collection methods such as phone surveys. Given the potential impact of such research, we would expect general-purpose nowcasting systems to quickly become a standard tool among non-computer scientists, yet it has largely remained a research topic. We believe a major obstacle to widespread adoption is the nowcasting feature selection problem. Typical now-casting systems require the user to choose a handful of social media objects from a pool of billions of potential candidates, which can be a time-consuming and error-prone process. We have built Ringtail, a nowcasting system that helps the user by automatically suggesting high-quality signals. We demonstrate that Ringtail can make nowcasting easier by suggesting relevant features for a range of topics. The user provides just a short topic query (e.g., unemployment) and a small conventional dataset in order for Ringtail to quickly return a usable predictive nowcasting model.
收起
摘要 :
Social media nowcasting-using online user activity to describe real-world phenomena-is an active area of research to supplement more traditional and costly data collection methods such as phone surveys. Given the potential impact ...
展开
Social media nowcasting-using online user activity to describe real-world phenomena-is an active area of research to supplement more traditional and costly data collection methods such as phone surveys. Given the potential impact of such research, we would expect general-purpose nowcasting systems to quickly become a standard tool among non-computer scientists, yet it has largely remained a research topic. We believe a major obstacle to widespread adoption is the nowcasting feature selection problem. Typical now-casting systems require the user to choose a handful of social media objects from a pool of billions of potential candidates, which can be a time-consuming and error-prone process. We have built Ringtail, a nowcasting system that helps the user by automatically suggesting high-quality signals. We demonstrate that Ringtail can make nowcasting easier by suggesting relevant features for a range of topics. The user provides just a short topic query (e.g., unemployment) and a small conventional dataset in order for Ringtail to quickly return a usable predictive nowcasting model.
收起